In this lab, we will apply nearest neighbors classification to the Endometrium vs. Uterus cancer data. For documentation see: http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification and http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
Let us start by setting up our environment, loading the data, and setting up our cross-validation.
In [ ]:
import numpy as np
%pylab inline
Question Load the data as in the previous lab.
In [ ]:
In [ ]:
# Set up a stratified 10-fold cross-validation
from sklearn import cross_validation
folds = cross_validation.StratifiedKFold(y, 10, shuffle=True)
In [ ]:
# This is the cross-validation method with scaling we defined in the previous labs.
from sklearn import preprocessing
def cross_validate(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
# Restrict data to train/test folds
Xtr = design_matrix[tr, :]
ytr = labels[tr]
Xte = design_matrix[te, :]
# Scale data
scaler = preprocessing.StandardScaler() # create scaler
Xtr = scaler.fit_transform(Xtr) # fit the scaler to the training data and transform training data
Xte = scaler.transform(Xte) # transform test data
# Fit classifier
classifier.fit(Xtr, ytr)
# Predict probabilities (of belonging to +1 class) on test data
yte_pred = classifier.predict_proba(Xte)
pred[te] = yte_pred[:, 1]
return pred
Question A nearest-neighbors classifier with k neighbors can be instantiated as:
clf = neighbors.KNeighborsClassifier(n_neighbors=k)
Cross-validate 15 nearest-neighbors classifiers, for k ranging from 1 to 29 (odd values of k only). Plot the area under the ROC curves you obtained as a function of k.
Why are we not using even values for k?
In [ ]:
from sklearn import neighbors
from sklearn import metrics
aurocs = []
for k in range(1, 30, 2): # values from 1 to 30, with a step size of 2
# TODO: Compute the vector ypred of cross-validated predictions of a k-nearest-neighbor classifier.
fpr, tpr, thresholds = metrics.roc_curve(y, ypred, pos_label=1)
aurocs.append(metrics.auc(fpr, tpr))
In [ ]:
plt.plot(range(1, 30, 2), aurocs, color='blue')
plt.xlabel('Number of nearest neighbors', fontsize=16)
plt.ylabel('Cross-validated AUC', fontsize=16)
plt.title('Nearest neighbors classification', fontsize=16)
Question Use 'grid_search.GridSearchCV' to set the optimal value of k automatically. On the previous plot, plot the area under the ROC curve you obtain as a horizontal line.
In [ ]:
Comment If the area under the ROC curve is lower than what you were expecting, check the score (i.e. scoring
parameter) for which the grid search CV parameter was optimized.
Let us look at the optimal value of the parameter k returned for the last fold.
In [ ]:
print clf.best_params_
Question Modify cross_validate(design_matrix, labels, classifier, cv_folds)
to take as classifier a GridSearchCV instance and print the best parameter(s) for each fold.
In [ ]:
def cross_validate_optimize(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn GridSearchCV object
GridSearchCV instance; must have the following methods/attributes:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
- best_params_ the best parameter dictionary
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vector of predictions (same order as labels).
"""
# TODO
Question How many nearest neighbors were chosen for each fold? How stable is this value?
In [ ]:
from sklearn import grid_search
param_grid = {'n_neighbors': range(1, 30, 2)}
clf = grid_search.GridSearchCV(neighbors.KNeighborsClassifier(),
param_grid, scoring='roc_auc')
ypred = cross_validate_optimize(X, y, clf, folds)
fpr, tpr, thresholds = metrics.roc_curve(y, ypred, pos_label=1)
Question How does the nearest-neighbors classifier compare to the linear regression (regularized or not)? Plot ROC curves.
In [ ]:
In [ ]:
Question What distance was used to define nearest neighbors? What other distances can you use? How does this affect performance?
You can find the documentation for nearest neighbors regression here: http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-regression
In [ ]: